We will use the BTS and IBTS data set. In this case we downloaded the data for starry ray. These dataset contains the CPUE per length per haul can be dowloaded from datras.ices.dk.
First we read in the IBTS dataset. From these data, the hauls in the North Sea are selected by keeping only those that are in (roundfish)areas 1-7.
IBTS <- read.csv(file.path(path,"CPUE per length per haul_2017-07-12 11_18_36.csv"), stringsAsFactors = F)
IBTS <- IBTS[IBTS$Area <= 7,]
The lengths are stored in mm and the CPUE is stored as number per hour. “Zero hauls” are included (hauls where no individuals are counted)
IBTS$NoPerHaul <- round(IBTS$CPUE_number_per_hour*(IBTS$HaulDur/60))
IBTS <- within(IBTS, haulCode <- paste(Year, Survey, Quarter, Ship, HaulNo, ShootLat, ShootLong, sep="_"))
We have a similar but slightly different data set for BTS. Here, the “zero hauls” have lengthclass NA and the number per haul is NA. These observations are then set to zero, so that we have the same structure as in the IBTS data set.
BTS <- read.csv(file.path(path,"CPUE per length per Hour and Swept Area_2017-07-12 11_29_36.csv"), stringsAsFactors = F)
BTS[is.na(BTS$LngtClass),]$NoPerHaul <- 0
BTS[is.na(BTS$LngtClass),]$LngtClass <- 0
The BTS swept area estimate is rounded, and thus not very useful. We recalculate it: the disatnce and beam width are in m, so the surface is in m2, and divided by 1e+06 to get km2
# there are some negative distances. We make those NA, and infer their surface later
BTS[!is.na(BTS$Distance) & BTS$Distance < 0,]$Distance <- NA
#now calculate surface
BTS$surface <- (BTS$Distance * BTS$BeamWidth)/1e+06
# summary of the surface gives number of NAs, and whether there are negative values
summary(BTS$surface)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0044 0.0296 0.0312 0.0347 0.0336 0.0858 569
summary(BTS$HaulDur)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 30.00 30.00 34.39 30.00 60.00
#make haul code
BTS <- within(BTS, haulCode <- paste(Year, Survey, Quarter, Ship, HaulNo, ShootLat, ShootLong, sep="_"))
The BTS and IBTS datasets need to be combined the two sets into one, using rbind() to combine them by rows. The datasets can only be combined if they have the same columns. The columns that are shared by the two data sets are found using intersect().
cols <- intersect(colnames(BTS), colnames(IBTS))
cpue <- rbind(BTS[,cols],IBTS[,cols])
#remove CPUE per hour that we do not need now that we have counts
cpue <- cpue[,!names(cpue)=="CPUE_number_per_hour"]
Now that the IBTS and BTS data sets are combined we want to make a set where we have counts for all hauls and all lenghts. This means first making an dentifier for each unique haul (based on the information we have for all the hauls). This identifier is used to make a “trawllist” where all the information for the hauls is stored together with its unique identifier.
Once the trawllist is make we use expand.grid() to make a combination of all hauls and lenght classes. This set is merged with our original data set.
trawllist <- cpue[!duplicated(cpue$haulCode),!names(cpue) %in% c("Species","AphiaID","NoPerHaul","Sex", "LngtClass")]
#expand grid
hxl <- expand.grid(haulCode=unique(cpue$haulCode),LngtClass=unique(cpue$LngtClass), stringsAsFactors = F)
full_cpue <- merge(hxl,cpue[names(cpue) %in% c("haulCode", "LngtClass","NoPerHaul")], all.x=T)
rm(hxl)
After we merged all possible combinations with the data we now have NAs for those lengts and hauls where the catch is zero, and so we set those to zero. This data is subsequently merged to the trawllist so that we have all information together.
full_cpue[is.na(full_cpue$NoPerHaul),]$NoPerHaul <- 0
full_cpue <- merge(full_cpue,trawllist, all.x=T)
The records that have lenghts equal to zero (that indicated zero hauls in our original set ) now need to be removed because we have all the information we need (these hauls now have zero catch for the full length range observed in the survey).
#now remove zero lengths
full_cpue <- full_cpue[full_cpue$LngtClass > 0,]
In addition, there are some observations that are highly unlikely: For instance there is a single observation of an individual of 100 cm (in 1977). This is highly suspicious because it is far larger than than any other observation, and likely due to species mis-identification. This can be seen in the histogram of length observations below. One could even consider removing the observations of individuals larger than 750 mm.
hist(full_cpue[full_cpue$NoPerHaul >0,]$LngtClass, breaks=100, xlab="Length (mm)", main="Observed lengths")
abline(v=990, col="red")
We remove this observation from the data. Note that when the length information is included in the analyses a further selection is made for the lengths.
#remove single unlikely large individual (of 1 m length)
full_cpue <- full_cpue[full_cpue$LngtClass < 990,]
head(full_cpue)
## haulCode LngtClass NoPerHaul Survey Quarter Ship Gear HaulNo Year HaulDur DayNight ShootLat
## 2 1966_NS-IBTS_1_AND_1_54.2333_7.45 10 0 NS-IBTS 1 AND H18 1 1966 60 D 54.2333
## 3 1966_NS-IBTS_1_AND_1_54.2333_7.45 20 0 NS-IBTS 1 AND H18 1 1966 60 D 54.2333
## 4 1966_NS-IBTS_1_AND_1_54.2333_7.45 30 0 NS-IBTS 1 AND H18 1 1966 60 D 54.2333
## 5 1966_NS-IBTS_1_AND_1_54.2333_7.45 50 0 NS-IBTS 1 AND H18 1 1966 60 D 54.2333
## 6 1966_NS-IBTS_1_AND_1_54.2333_7.45 60 0 NS-IBTS 1 AND H18 1 1966 60 D 54.2333
## 7 1966_NS-IBTS_1_AND_1_54.2333_7.45 70 0 NS-IBTS 1 AND H18 1 1966 60 D 54.2333
## ShootLong Depth
## 2 7.45 40
## 3 7.45 40
## 4 7.45 40
## 5 7.45 40
## 6 7.45 40
## 7 7.45 40
For our spatial correlation we will need an isomorphic coordinate system. Therefore we transform the latitudes and longitudes to UTM coordinates. these UTM coordinates are given in meters, so we divide them by 1000 to get kilometers.
UTM <- project(cbind(full_cpue$ShootLong, full_cpue$ShootLat), "+proj=utm +zone=31U ellps=WGS84")
full_cpue$Xkm <- UTM[,1]/1000
full_cpue$Ykm <- UTM[,2]/1000
The INLA code does not like special characters in some of the variable names, like the hyphen in “NS-IBTS”. Therefore we rename the survey to NSIBTS.
full_cpue <- transform(full_cpue, Survey=gsub("-","",Survey))
We still have a lot of data (2077880), Because these are just here as example, we will subset the data, and start our analysis from a fiven year. That year is stored in startyr.
startyr <- 2005
cpue_subset <- full_cpue[full_cpue$Year >= startyr,]
In this example we thus start from 2005. The earlier we start, the more information we will obtain from the analysis on the annual changes in the count data. However, including more data also means we require more memory to store the model, and wait longer for the model to run. If too much data is used in the inla model, so that more computer memory is required than you have available, the model will crash. Now we have 732000 rows left in cpue_subset. We start with a very simple analysis where we do not take length into account, and only analyse numbers per haul. Hence we aggregate the counts per haul.
cpue_subset <- aggregate(NoPerHaul ~ haulCode + Survey + Quarter + Ship + Gear + HaulNo + Year + HaulDur + ShootLong + ShootLat + Xkm + Ykm + Depth, data= cpue_subset,FUN="sum")
After the aggregation (now that the length information is removed) we have 9619 rows in the data set. To give us a quick impression of the data we plot it in a map. The two surveys have different colors, red being the IBTS hauls and black being the BTS hauls. The size of the circles indicate the counts in each haul
par(mfrow=c(1,2))
plot(cpue_subset$ShootLong,cpue_subset$ShootLat,
cex=0.1,
col=as.numeric(as.factor(cpue_subset$Survey)),
xlab="Longitude", ylab="Lattitude", main="Location of all hauls" )
map("world",add=T)
plot(cpue_subset$ShootLong,cpue_subset$ShootLat,
cex=(cpue_subset$NoPerHaul)/10,
col=as.numeric(as.factor(cpue_subset$Survey)),
xlab="Longitude", ylab="Lattitude", main="Counts per hauls" )
map("world",add=T)
Clearly the IBTS hauls cover a larger part of the North Sea, but the large catches for both are located in the same areas.
We also plot a quick histogram of the counts, to give us an impression of the statistical distribution of the data.
par(mfrow=c(1,2))
hist(cpue_subset[cpue_subset$Survey == "BTS",]$NoPerHaul, 200, xlab= "Number per haul", main="Counts per haul BTS")
hist(cpue_subset[cpue_subset$Survey == "NSIBTS",]$NoPerHaul, 200, xlab= "Number per haul", main="Counts per haul IBTS")
The count data is clearly quite skewed, with lots of zeros and small values, and very few large counts. Because of this distribution and because these are counts, a Poisson and a negative binomial distribution will be used in the inla model.
The UTM coordinates of the data set are combined into a Loc (location) dataset. That data set will be used later to generate spatial meshes for the data.
Loc <- cbind(cpue_subset$Xkm , cpue_subset$Ykm )
Next we need a mesh for the spatial data. Because we do not want our spatial correlations to pass landmasses (e.g. Denmark) we first make a nonconvex hull of the data points using inla.nonconvex.hull(). This convex hull is used as a boundary for making a 2d mesh. The generation of the actual mesh is done using inla.mesh.2d(). That function takes several arguments, incuding “cutoff” and “max.edge”. These arguments specify how fine the final mesh will be. Finer meshes will be able to capture smaller scale spatial correlations, but require more comuting time in the inla model.
ConvHull <- inla.nonconvex.hull(Loc, convex=-0.02, resolution=90)
mesh1a <- inla.mesh.2d(boundary = ConvHull, max.edge=c(30))
As an alternative to using the nonconvex hull function to generate a boundary,, we can also take the shapefile of the North Seaas the boundary of the mesh, which makes prediction in the future easier, as we know no record can be found outside the North Sea area (give or take evolution). We first load the ICES shapefiles which were downloaded from http://gis.ices.dk/sf/. We load this shapefiles, merge layers together, transform to UTM, convert UTM to km rather than meters and finally create a mesh.
ICESareas <- readShapePoly(file.path(path,"ICES_Areas_20160601_dense"))
## Warning: use rgdal::readOGR or sf::st_read
NorthSea <- subset(ICESareas,SubArea==4)
NorthSea <- gUnionCascaded(NorthSea)
proj4string(NorthSea) <- c("+proj=longlat")
NorthSeaUTM <- spTransform(NorthSea,CRS("+proj=utm +zone=31"))
NS.border <- inla.sp2segment(NorthSeaUTM)
NS.border$loc <- NS.border$loc/1000
mesh1b <- inla.mesh.2d(boundary=NS.border, cutoff=3,max.edge=c(30))
The meshes can be plotted using the plot() function on the mesh object. Once the mesh is plotted, the locations of the samples (stored in the Loc object) can be overlayed using points(). Below, the two meshes are plotted side-by-side.
par(mfrow=c(1,2))
plot(mesh1a)
points(Loc, col = 1, pch = 16, cex = 0.3)
plot(mesh1b)
points(Loc, col = 1, pch = 16, cex = 0.3)
Once the 2d mesh is made we construct a observation/prediction weight matrix for the model. This is also called the “projector matrix”.
# 2. Define the weighting factors a_ik (also called the projector matrix).
A1 <- inla.spde.make.A(mesh = mesh1a, loc = Loc)
dim(A1)
## [1] 9619 1662
The first dimension of the projector matrix has the size of the number of observations (here 9619), and the second dimension of the projector matrix is the number of nodes in the mesh (here 1662).
# 3. Define the spde
spde <- inla.spde2.matern(mesh1a)
w.st <- inla.spde.make.index('w', n.spde = spde$n.spde)
The stack allows INLA to build models with complex linear predictors. Here have a SPDE model combined with covariate fixed effects and an intercept at n hauls.
Before making the stack we need to convert all fixed effects that are factors in the INLA model.
cpue_subset$fYear <- as.factor(cpue_subset$Year)
cpue_subset$fSurvey <- as.factor(cpue_subset$Survey)
Note that in code below the names in the model matrix should not contain any special characters! This is why we renamed the “NS-IBTS” survey to “NSIBTS”
# 5. Make a stack. In this process we tell INLA at which points on the mesh we sampled the response variable and the covariates.
Xmatrix <- model.matrix(~ fYear + fSurvey + HaulDur, data = cpue_subset)
head(Xmatrix)
## (Intercept) fYear2006 fYear2007 fYear2008 fYear2009 fYear2010 fYear2011 fYear2012 fYear2013 fYear2014 fYear2015
## 1 1 0 0 0 0 0 0 0 0 0 1
## 2 1 0 0 0 0 0 0 0 0 1 0
## 3 1 0 0 0 0 0 0 1 0 0 0
## 4 1 0 0 0 0 0 0 0 0 1 0
## 5 1 0 0 0 0 0 0 0 0 0 1
## 6 1 0 0 1 0 0 0 0 0 0 0
## fYear2016 fYear2017 fSurveyNSIBTS HaulDur
## 1 0 0 0 30
## 2 0 0 0 30
## 3 0 0 1 30
## 4 0 0 0 30
## 5 0 0 0 30
## 6 0 0 0 30
This Xmatrix contains the model matrix with the fixed effects, including the intercept (The column for the intercept is named “(Intercept)”, and it is 1 for all observations). However, in the next step the intercept is removed from the model matrix. The intercept is then included when making the stack, and named “Intercept” (without brackets).
X <- as.data.frame(Xmatrix[,-1])
names(X) <- c(gsub("[:]",".",names(X)))
head(X)
## fYear2006 fYear2007 fYear2008 fYear2009 fYear2010 fYear2011 fYear2012 fYear2013 fYear2014 fYear2015 fYear2016
## 1 0 0 0 0 0 0 0 0 0 1 0
## 2 0 0 0 0 0 0 0 0 1 0 0
## 3 0 0 0 0 0 0 1 0 0 0 0
## 4 0 0 0 0 0 0 0 0 1 0 0
## 5 0 0 0 0 0 0 0 0 0 1 0
## 6 0 0 1 0 0 0 0 0 0 0 0
## fYear2017 fSurveyNSIBTS HaulDur
## 1 0 0 30
## 2 0 0 30
## 3 0 1 30
## 4 0 0 30
## 5 0 0 30
## 6 0 0 30
N <- nrow(cpue_subset)
Stack1 <- inla.stack(
tag = "Fit",
data = list(y = cpue_subset$NoPerHaul),
A = list(1,1, A1),
effects = list(
Intercept=rep(1,N),
X=X, #Covariates
w=w.st)) #Spatial field
The model formula used in the inla model is generated from the names of the model matrix, combined with the intercept term and the spatial correlation model (“f(w, model=spde)”).
Subsequently, two inla models are run, one assuming that the data are Poisson distributed, and another model assuming that the data are negative binomial distributed.
fsp <- parse(text=c("y ~ -1 + Intercept + ",paste(c(names(X)," f(w, model = spde)"),collapse =" + ")))
INLA:::inla.dynload.workaround()
I1p <- inla(eval(fsp), family = "poisson", data=inla.stack.data(Stack1),
control.compute = list(dic = TRUE, waic = TRUE),
control.predictor = list(A = inla.stack.A(Stack1)))
I1nb <- inla(eval(fsp), family = "nbinomial", data=inla.stack.data(Stack1),
control.compute = list(dic = TRUE, waic = TRUE, config=TRUE),
control.predictor = list(A = inla.stack.A(Stack1)))
Once the INLA models are run, a summary can be printed for the two models. This summary contains much of the relevant information for the models.
summary(I1p)
##
## Call:
## c("inla(formula = eval(fsp), family = \"poisson\", data = inla.stack.data(Stack1), ", " control.compute = list(dic = TRUE, waic = TRUE), control.predictor = list(A = inla.stack.A(Stack1)))" )
##
## Time used:
## Pre-processing Running inla Post-processing Total
## 0.6041 96.5902 0.8910 98.0852
##
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## Intercept -2.1978 0.4127 -3.0399 -2.1876 -1.4127 -2.1683 0
## fYear2006 -0.1518 0.0414 -0.2332 -0.1517 -0.0705 -0.1517 0
## fYear2007 0.1409 0.0376 0.0671 0.1409 0.2146 0.1408 0
## fYear2008 -0.0901 0.0389 -0.1663 -0.0901 -0.0139 -0.0901 0
## fYear2009 -0.2386 0.0408 -0.3188 -0.2386 -0.1587 -0.2386 0
## fYear2010 -0.3350 0.0414 -0.4165 -0.3350 -0.2538 -0.3349 0
## fYear2011 -0.4101 0.0412 -0.4910 -0.4100 -0.3293 -0.4100 0
## fYear2012 -0.2396 0.0399 -0.3179 -0.2396 -0.1614 -0.2396 0
## fYear2013 -0.6220 0.0437 -0.7080 -0.6220 -0.5364 -0.6219 0
## fYear2014 -0.5847 0.0465 -0.6763 -0.5846 -0.4937 -0.5844 0
## fYear2015 -0.4515 0.0433 -0.5367 -0.4515 -0.3667 -0.4514 0
## fYear2016 -0.5482 0.0438 -0.6344 -0.5482 -0.4624 -0.5481 0
## fYear2017 -0.6246 0.0806 -0.7852 -0.6238 -0.4684 -0.6222 0
## fSurveyNSIBTS -1.2206 0.0258 -1.2713 -1.2206 -1.1702 -1.2206 0
## HaulDur 0.0310 0.0039 0.0235 0.0310 0.0387 0.0310 0
##
## Random effects:
## Name Model
## w SPDE2 model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant 0.975quant mode
## Theta1 for w 1.119 0.060 1.002 1.119 1.238 1.118
## Theta2 for w -3.356 0.085 -3.525 -3.355 -3.190 -3.353
##
## Expected number of effective parameters(std dev): 618.12(11.40)
## Number of equivalent replicates : 15.56
##
## Deviance Information Criterion (DIC) ...: 24644.59
## Effective number of parameters .........: 601.04
##
## Watanabe-Akaike information criterion (WAIC) ...: 26392.51
## Effective number of parameters .................: 1953.64
##
## Marginal log-Likelihood: -13024.91
## Posterior marginals for linear predictor and fitted values computed
summary(I1nb)
##
## Call:
## c("inla(formula = eval(fsp), family = \"nbinomial\", data = inla.stack.data(Stack1), ", " control.compute = list(dic = TRUE, waic = TRUE, config = TRUE), ", " control.predictor = list(A = inla.stack.A(Stack1)))")
##
## Time used:
## Pre-processing Running inla Post-processing Total
## 0.5113 58.0383 0.9622 59.5118
##
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## Intercept -2.1607 0.7571 -3.6939 -2.1516 -0.6780 -2.1341 0
## fYear2006 -0.1517 0.0974 -0.3428 -0.1517 0.0394 -0.1517 0
## fYear2007 0.0220 0.0957 -0.1658 0.0219 0.2096 0.0219 0
## fYear2008 -0.0601 0.0975 -0.2514 -0.0601 0.1313 -0.0602 0
## fYear2009 -0.4871 0.1003 -0.6841 -0.4871 -0.2903 -0.4870 0
## fYear2010 -0.6851 0.1013 -0.8842 -0.6851 -0.4864 -0.6850 0
## fYear2011 -0.6422 0.1000 -0.8387 -0.6422 -0.4462 -0.6421 0
## fYear2012 -0.4141 0.0985 -0.6075 -0.4141 -0.2208 -0.4141 0
## fYear2013 -0.7856 0.1021 -0.9862 -0.7856 -0.5855 -0.7855 0
## fYear2014 -0.5050 0.1045 -0.7101 -0.5050 -0.2998 -0.5050 0
## fYear2015 -0.3734 0.1028 -0.5753 -0.3734 -0.1716 -0.3735 0
## fYear2016 -0.4526 0.1025 -0.6538 -0.4527 -0.2514 -0.4527 0
## fYear2017 -0.3128 0.1435 -0.5942 -0.3129 -0.0312 -0.3131 0
## fSurveyNSIBTS -1.0224 0.0589 -1.1380 -1.0223 -0.9069 -1.0223 0
## HaulDur 0.0377 0.0071 0.0238 0.0377 0.0516 0.0377 0
##
## Random effects:
## Name Model
## w SPDE2 model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant 0.975quant mode
## size for the nbinomial observations (1/overdispersion) 0.7374 0.0287 0.6823 0.7369 0.7954 0.7361
## Theta1 for w 1.9904 0.0781 1.8385 1.9896 2.1457 1.9866
## Theta2 for w -4.1107 0.1281 -4.3690 -4.1081 -3.8646 -4.0988
##
## Expected number of effective parameters(std dev): 332.78(15.32)
## Number of equivalent replicates : 28.91
##
## Deviance Information Criterion (DIC) ...: 18869.74
## Effective number of parameters .........: 324.93
##
## Watanabe-Akaike information criterion (WAIC) ...: 18905.97
## Effective number of parameters .................: 321.18
##
## Marginal log-Likelihood: -9738.02
## Posterior marginals for linear predictor and fitted values computed
Plot the histograms of observations and fits for both models
idx <- inla.stack.index(Stack1, tag= 'Fit')$data
par(mfrow=c(1,2))
hist(round(I1p$summary.fitted.values[idx,"mean"],0),300, xlim=c(0,50), ylim=c(0,4000), xlab="Counts", main="Poisson model")
obshist <- table(cpue_subset$NoPerHaul)
points(x=names(obshist), y=c(obshist), pch=20)
hist(round(I1nb$summary.fitted.values[idx,"mean"],0),200, xlim=c(0,50), ylim=c(0,4000), xlab="Counts", main="Negative binomial model")
obshist <- table(cpue_subset$NoPerHaul)
points(x=names(obshist), y=c(obshist), pch=20)
wproj <- inla.mesh.projector(mesh1a, xlim = range(Loc[,1]), ylim = range(Loc[,2]))
wm.pm100100 <- inla.mesh.project(wproj, I1nb$summary.random$w$mean)
wsd.pm100100 <- inla.mesh.project(wproj, I1nb$summary.random$w$sd)
grid <- expand.grid(x = wproj$x, y = wproj$y)
grid$zm <- as.vector(wm.pm100100)
grid$zsd <- as.vector(wsd.pm100100)
wld <- map('world', xlim=c(-5,15), ylim=c(47,62),plot=FALSE)
wld <- data.frame(lon=wld$x, lat=wld$y)
UTMmap <- project(cbind(wld$lon, wld$lat), "+proj=utm +zone=31U ellps=WGS84")
UTMmapFinal <- data.frame("xm" =UTMmap[,1]/1e3, "ym" = UTMmap[,2]/1e3)
p1 <- levelplot(zm ~ x * y ,
data = grid,
scales = list(draw = TRUE),
xlab = list("Easting", cex = 1),
ylab = list("Northing", cex = 1),
main = list("Posterior mean spatial random fields", cex = 1),
col.regions=tim.colors(25, alpha = 1),
panel=function(x, y, z, subscripts,...){
panel.levelplot(x, y, z, subscripts,...)
grid.points(x = cpue_subset$Xkm,
y = cpue_subset$Ykm,
pch = 1,
size = unit(cpue_subset$NoPerHaul/15, "char"))
}) + xyplot(ym~ xm, UTMmapFinal, type='l', lty=1, lwd=0.5, col='black')
p2 <- levelplot(zsd ~ x * y,
data = grid,
scales = list(draw = TRUE),
xlab = list("Easting", cex = 1),
ylab = list("Northing", cex = 1),
main = list("Posterior sd spatial random fields", cex = 1),
col.regions=tim.colors(25, alpha = 1),
panel=function(x, y, z, subscripts,...){
panel.levelplot(x, y, z, subscripts,...)
}) + xyplot(ym~ xm, UTMmapFinal, type='l', lty=1, lwd=0.5, col='black')
grid.arrange(p1,p2, ncol=2)
Wat happens here?
tcoo <- rbind(c(0.3, 0.3), c(0.5, 0.5), c(0.7, 0.7))
inla.nonconvex.hull
dim(Ap <- inla.spde.make.A(mesh = mesh1, loc = tcoo))
## [1] 3 505
x0 <- c(0.5, 0.5, 0.5)
#To do a fully Bayesian analysis, we include the target locations on the estimation process by
#assigning NA for the response at these locations. Defining the prediction stack
stk.pred <- inla.stack(tag='pred', A=list(Ap, 1), data=list(y=NA), ## response as NA
effects=list(s=1:spde$n.spde, data.frame(x=x0, b0=1)))
#Fit the model again with the full stack
stk.full <- inla.stack(stk.e, stk.pred)
p.res <- inla(formula, data=inla.stack.data(stk.full), ## full stack
control.predictor=list(compute=TRUE, ## compute the predictor
A=inla.stack.A(stk.full))) ## using full stack data
#Get the prediction data index and collect the PMD to work with
pred.ind <- inla.stack.index(stk.full, tag = "pred")$data
ypost <- p.res$marginals.fitted.values[pred.ind]
#Visualize PMD for the linear predictor at target locations with commands bellow
names(ypost) <- paste('y', seq_along(ypost), sep='_'); library(plyr)
xyplot(y~x | .id, ldply(ypost), panel='llines', xlab='y', ylab='Density')
#In INLA we have some functions to work with marginals distributions
apropos("marginal")
## [1] "inla.dmarginal" "inla.emarginal" "inla.hpdmarginal"
## [4] "inla.mmarginal" "inla.pmarginal" "inla.qmarginal"
## [7] "inla.rmarginal" "inla.smarginal" "inla.tmarginal"
inla.mmarginal(ypost[[1]]) ## mode
## [1] 1.697
inla.qmarginal(c(0.15, 0.7), ## quantiles
ypost[[1]])
## [1] 1.345 1.875
inla.pmarginal(inla.qmarginal(
0.3, ypost[[1]]), ypost[[1]])
## [1] 0.3
Owing to ontogenetic niche shifts we expect that the spatial distribution of small indivduals is different from the spatial distribution of large indivuals. Hence we want to include a length component in the spatial distribution of the counts.
First we need to make a new subset of the data that includes the length information (and leaving out the aggregation step in our earlier analysis). We use the same starting year as was done for the previous analysis: 2005.
# make selection of fullset.
cpue_subset <- full_cpue[full_cpue$Year >= startyr,]
Let’s make 5 cm classes instead of 1 cm classes. This reduces the number of observations. We use round to achieve this. Remember that the units of the length measurements is mm.
Once we have 5 cm lenght classes, we need aggregate() to sum the numbers per haul within our new length bins. Becaus we need the other info in the data set as well we include all variables in the aggregate.
cpue_subset$LngtClass <- round(cpue_subset$LngtClass/50)*50
cpue_subset <- aggregate(NoPerHaul~ LngtClass + haulCode + Survey + Quarter + Ship + Gear + HaulNo + Year + HaulDur + ShootLong +
ShootLat + Xkm + Ykm + Depth, data= cpue_subset,FUN="sum")
Below, we inspect the length range of observations. There are no individuals with lengths over 600 mm observed. Hence, we remove those lengthclasses. If there are too many subsequent lenght classes with only zeros, the model will fail, giving a warning that one of the eigenvalues is negative.
aggregate(NoPerHaul~LngtClass, data=cpue_subset, FUN= "sum")
## LngtClass NoPerHaul
## 1 0 3
## 2 50 13
## 3 100 816
## 4 150 807
## 5 200 973
## 6 250 1390
## 7 300 2292
## 8 350 3047
## 9 400 3361
## 10 450 2122
## 11 500 370
## 12 550 28
## 13 600 4
## 14 650 2
## 15 700 0
## 16 750 0
## 17 800 0
## 18 850 0
cpue_subset <- cpue_subset[cpue_subset$LngtClass > 0,]
cpue_subset <- cpue_subset[cpue_subset$LngtClass < 650,]
Because we now have combinations for all hauls and several length classes, the subset of data (since 2009) is still very big. It has115428 observations. In order to reduce the number of observations used in the INLA model, we take all non-zero observations but we subsample the zero observations. The remaining zero observations willneed to receive a higher weight in the final model. Note that this is not yet implemented.
cpue_subset_temp <- cpue_subset[cpue_subset$NoPerHaul > 0,]
cpue_subset_temp <-rbind(cpue_subset_temp, cpue_subset[sample(which(cpue_subset$NoPerHaul==0),40000),])
cpue_subset <- cpue_subset_temp
As before, The UTM coordinates of the observations are combined into a Loc (location) dataset. That dataset is used to create a mesh in the next step.
Loc <- cbind(cpue_subset$Xkm , cpue_subset$Ykm )
Next we need a mesh for the spatial data. Because we do not want our spatial correlations to pass landmasses (e.g. Denmark) we first make a convex hull of the data points using inla.nonconvex.hull(). This convex hull is used as a boundary for making a 2d mesh.
ConvHull <- inla.nonconvex.hull(Loc, convex=-0.02, resolution=90)
mesh2a <- inla.mesh.2d(boundary = ConvHull, max.edge=c(30))
plot(mesh2a)
points(Loc, col = 1, pch = 16, cex = 0.5)
Next we inspect the length range in the data by making a histogram of length observations. This histogram can be used to define the locations ofa number of “knots” along the length range that we will later use for our analysis. More knots means a longer computing time (but a higher degree of flexibility in the length component of the spatial correlation).
There are no individuals with lengths over 680 mm observed. Hence, we remove those lengthclasses (with only zeros) from the data set and define the knots to be within the new length range
hist(cpue_subset$LngtClass,200, main="", xlab="Length class (mm)")
knots <- seq(from = 50, to = 550, by = 100)
knots
## [1] 50 150 250 350 450 550
Using the 6 knots as locations, we make a 1 dimensional mesh.
# One-dimensional mesh for length class. See the time series module
mesh.t <- inla.mesh.1d(loc = knots)
In this mesh.t object, the dimensions can be checked using mesh.t\(loc (for the locations) and mesh.t\)n (the number of locations). The code below confirms that there are 6 locations, and those are at the values of the knots object.
mesh.t$n
## [1] 6
mesh.t$loc
## [1] 50 150 250 350 450 550
Now that there is a 1 dimensional mesh for the lengths, we use it to construct a observation/prediction weight matrix (“projector matrix”) based on the spatial mesh that we already created earlier (mesh1) and our new mesh for the lengths. The lengths are used in the “group model”. The new projector matrix is names A2 to distinguish it from the projector matrix of the previous model.
# 2. Define the weighting factors a_ik (projector matrix).
NGroups <- length(knots)
A2 <- inla.spde.make.A(mesh = mesh2a,
loc = Loc,
group = cpue_subset$LngtClass,
group.mesh = mesh.t)
dim(A2)
## [1] 115428 9972
# 3. Define the spde
spde <- inla.spde2.matern(mesh2a)
We need to make an inla.spde model object for a Matern model, but we still have that available from the model without size structure. That object was named “spde”. We use it to make a list of named index vectors for the SPDE model. Note that the command for making the list of index vectors now includes an argument for the groups.
w.st <- inla.spde.make.index('w',
n.spde = spde$n.spde,
n.group = NGroups)
Before making the stack we need to convert all fixed effects that are factors in the INLA model.
cpue_subset$fYear <- as.factor(cpue_subset$Year)
cpue_subset$fSurvey <- as.factor(cpue_subset$Survey)
Next we make a new stack. For this we need a model matrix. Although the fixed effects are the same as in the previous model, we still need to make a new model matrix because the data now include the length structure.
# 5. Make a stack.
Xmatrix <- model.matrix(~ fYear + fSurvey + HaulDur, data = cpue_subset)
head(Xmatrix)
## (Intercept) fYear2006 fYear2007 fYear2008 fYear2009 fYear2010 fYear2011 fYear2012 fYear2013 fYear2014 fYear2015
## 2 1 0 0 0 0 0 0 0 0 0 1
## 3 1 0 0 0 0 0 0 0 0 0 1
## 4 1 0 0 0 0 0 0 0 0 0 1
## 5 1 0 0 0 0 0 0 0 0 0 1
## 6 1 0 0 0 0 0 0 0 0 0 1
## 7 1 0 0 0 0 0 0 0 0 0 1
## fYear2016 fYear2017 fSurveyNSIBTS HaulDur
## 2 0 0 0 30
## 3 0 0 0 30
## 4 0 0 0 30
## 5 0 0 0 30
## 6 0 0 0 30
## 7 0 0 0 30
This Xmatrix contains the model matrix with the fixed effects, including the intercept (The column for the intercept is named “(Intercept)”, and it is 1 for all observations). However, in the next step the intercept is removed from the model matrix. The intercept is then included when making the stack, and named “Intercept” (without brackets).
X <- as.data.frame(Xmatrix[,-1])
names(X) <- c(gsub("[:]",".",names(X)))
head(X)
## fYear2006 fYear2007 fYear2008 fYear2009 fYear2010 fYear2011 fYear2012 fYear2013 fYear2014 fYear2015 fYear2016
## 2 0 0 0 0 0 0 0 0 0 1 0
## 3 0 0 0 0 0 0 0 0 0 1 0
## 4 0 0 0 0 0 0 0 0 0 1 0
## 5 0 0 0 0 0 0 0 0 0 1 0
## 6 0 0 0 0 0 0 0 0 0 1 0
## 7 0 0 0 0 0 0 0 0 0 1 0
## fYear2017 fSurveyNSIBTS HaulDur
## 2 0 0 30
## 3 0 0 30
## 4 0 0 30
## 5 0 0 30
## 6 0 0 30
## 7 0 0 30
N <- nrow(cpue_subset)
Stack2 <- inla.stack(
tag = "Fit",
data = list(y = cpue_subset$NoPerHaul),
A = list(1,1, A2),
effects = list(
Intercept = rep(1, N),
X = as.data.frame(X), # Covariates
w.st)) # Spatial-temp field
fsp <- parse(text=c("y ~ -1 + Intercept + ",
paste(c(names(X)," f(w, model = spde, group = w.group, control.group = list(model = 'ar1'))"),collapse =" + ")))
I2nb <- inla(eval(fsp), family = "nbinomial",
data=inla.stack.data(Stack2),
control.compute = list(dic = TRUE, waic = TRUE),
control.predictor = list(A = inla.stack.A(Stack2)))
summary(I2nb)
We still have The “UTMmap” object for creating the maps from the previous analysis.
w <- I2nb$summary.random$w$mean
# length of w is mesh$n * NGroups
wproj <- inla.mesh.projector(mesh2a, xlim = range(Loc[,1]), ylim = range(Loc[,2]))
grid <- expand.grid(length=knots, x = wproj$x, y = wproj$y,zm=NA)
for (i in knots){
w.pm100100 <- inla.mesh.project(wproj,w[w.st$w.group==which(i==knots)])
grid[grid$length==i,]$zm <- as.vector(w.pm100100)
}
Next we print the grid, which is now estimated at each knot. The observed counts for the length class at each knot are included in each panel.
print(levelplot(zm ~ x * y |length,
data = grid,
scales = list(draw = TRUE),
xlab = list("Easting", cex = 1),
ylab = list("Northing", cex = 1),
main = list("Posterior mean spatial random fields", cex = 1),
col.regions=tim.colors(25, alpha = 1),
panel=function(x, y, z, subscripts,...){
panel.levelplot(x, y, z, subscripts,...)
grid.points(x = cpue_subset[cpue_subset$LngtClass == grid[subscripts[1],]$length,]$Xkm,
y = cpue_subset[cpue_subset$LngtClass == grid[subscripts[1],]$length,]$Ykm,
pch = 1,
size = unit(cpue_subset[cpue_subset$LngtClass ==grid[subscripts[1],]$length,]$NoPerHaul/15, "char"))
}) + xyplot(ym~ xm, UTMmapFinal, type='l', lty=1, lwd=0.5, col='black'))
# 5. Make a stack.
Xmatrix <- model.matrix(~ fYear + fSurvey + HaulDur, data = cpue_subset)
head(Xmatrix)
## (Intercept) fYear2006 fYear2007 fYear2008 fYear2009 fYear2010 fYear2011 fYear2012 fYear2013 fYear2014 fYear2015
## 2 1 0 0 0 0 0 0 0 0 0 1
## 3 1 0 0 0 0 0 0 0 0 0 1
## 4 1 0 0 0 0 0 0 0 0 0 1
## 5 1 0 0 0 0 0 0 0 0 0 1
## 6 1 0 0 0 0 0 0 0 0 0 1
## 7 1 0 0 0 0 0 0 0 0 0 1
## fYear2016 fYear2017 fSurveyNSIBTS HaulDur
## 2 0 0 0 30
## 3 0 0 0 30
## 4 0 0 0 30
## 5 0 0 0 30
## 6 0 0 0 30
## 7 0 0 0 30
X <- as.data.frame(Xmatrix[,-1])
names(X) <- c(gsub("[:]",".",names(X)))
head(X)
## fYear2006 fYear2007 fYear2008 fYear2009 fYear2010 fYear2011 fYear2012 fYear2013 fYear2014 fYear2015 fYear2016
## 2 0 0 0 0 0 0 0 0 0 1 0
## 3 0 0 0 0 0 0 0 0 0 1 0
## 4 0 0 0 0 0 0 0 0 0 1 0
## 5 0 0 0 0 0 0 0 0 0 1 0
## 6 0 0 0 0 0 0 0 0 0 1 0
## 7 0 0 0 0 0 0 0 0 0 1 0
## fYear2017 fSurveyNSIBTS HaulDur
## 2 0 0 30
## 3 0 0 30
## 4 0 0 30
## 5 0 0 30
## 6 0 0 30
## 7 0 0 30
N <- nrow(cpue_subset)
Stack3 <- inla.stack(
tag = "Fit",
data = list(y = cpue_subset$NoPerHaul),
A = list(1,1, A2),
effects = list(
Intercept = rep(1, N),
X = as.data.frame(X), # Covariates
w.st)) # Spatial-temp field
fsp <- parse(text=c("y ~ -1 + Intercept + ",
paste(c(names(X)," f(w, model = spde, group = w.group, control.group = list(model = 'ar1'))", "f(LngtClass, group=fSurvey,'rw2')",collapse =" + ")))
I3nb <- inla(eval(fsp), family = "nbinomial",
data=inla.stack.data(Stack3),
control.compute = list(dic = TRUE, waic = TRUE),
control.predictor = list(A = inla.stack.A(Stack3)))
summary(I3nb)
We still have the IBTS data available in the cpue_IBTS object, but that does not provide the swept area. For this, we need to read in the HH records of the IBTS. These HH records are available in the IBTS exchange files. The swept areas can be calculated from the travelled distance and the door spread or wing spread. Missing values for thease variables are coded as -9. In the r code those are made into NAs.
IBTSHH <- read.csv(file.path(path,"Exchange Data_2017-07-14 09_26_25.csv"), stringsAsFactors = F)
IBTSHH <- within(IBTSHH, haulCode <- paste(Year, Survey, Quarter, Ship, HaulNo, ShootLat, ShootLong, sep="_"))
IBTSHH[IBTSHH$DoorSpread == -9,]$DoorSpread <- NA
IBTSHH[IBTSHH$Distance == -9,]$Distance <- NA
IBTSHH[IBTSHH$WingSpread == -9,]$WingSpread <- NA
Next, we plot a histogram of haul duration, and a scatterplot of doorspread versus wingspread to get a look at the data. Clearly, haul duration is stored in minutes, with most of the hausl being either 30 minutes or 1 hour. Hauls longer than 80 minute are removed. Because thes contain a large proportion of outliers with respect to e.g. distance. This step removes 36 from the 27703 hauls.
The wing spread and doorspread are stored in meters, with wing spreads being in the range 7, 50 meters and doorspread being in the range 1, 179 meters. There is one clear outlier
par(mfrow=c(1,2))
hist(IBTSHH$HaulDur,120, xlim=c(0,120), main= "", xlab="Haul duration (minutes)")
abline(v=80, lty=2)
IBTSHH <- IBTSHH[IBTSHH$HaulDur <= 80,]
plot(IBTSHH$WingSpread,IBTSHH$DoorSpread, xlab="Wingspread (m)", ylab= "Doorspread (m)")
abline(a=0,b=1, lty=2)
A surface is calculated by multiplying wing spread by distance. Both are in metres. The resulting surface is thus in m2. This is converted to km2 by dividing by 1e+06.
IBTSHH$surface <- (IBTSHH$Distance * IBTSHH$WingSpread)/1e+06
Next, only relevant variables are selected.
IBTSHH <- IBTSHH[names(IBTSHH) %in% c("haulCode", "surface", "HaulDur","Doorspread","Distance","WingSpread")]
Ploting the surface against haul duration and using a simple linear model (without intercept) we hope to find a relationship.
plot(x=IBTSHH$HaulDur,y=IBTSHH$surface, pch=20, main= "IBTS", xlab= "Haul duration ( minutes)", ylab="Surface (km2)", xlim=c(0,80))
linmod <- lm(surface~ -1 + HaulDur, data=IBTSHH)
summary(linmod)
##
## Call:
## lm(formula = surface ~ -1 + HaulDur, data = IBTSHH)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.041650 -0.006689 -0.000587 0.005138 0.116422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## HaulDur 2.293e-03 4.171e-06 549.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01043 on 6494 degrees of freedom
## (21172 observations deleted due to missingness)
## Multiple R-squared: 0.979, Adjusted R-squared: 0.979
## F-statistic: 3.022e+05 on 1 and 6494 DF, p-value: < 2.2e-16
abline(linmod, lty=2)
Indeed, there is a good relationship, where each minute of haul duration adds coef(linmod) km2 haul surface for the IBTS. This relationship is used to calculate the surface of hauls where only haul duration is known.
IBTSHH[is.na(IBTSHH$surface),]$surface <- as.numeric(predict(linmod,newdata=IBTSHH))[is.na(IBTSHH$surface)]
Merge the haul information to the full cpue set, after removing haulduration from the haul data (because we have it in the CPUE set already)
IBTSHH$HaulDur <- NULL
IBTS <- merge(IBTS, IBTSHH,by= "haulCode", all.x=T, all.y=F)
After this merge, there are nrow(IBTS[is.na(IBTS$surface),]) observation in the IBTS data set that do not have a surface estimate, because their haul duartion was larger than the treshold used in our calculations.
We use the same procedure of inferring haul surface from haul duration for the missing surfaces of the BTS.
plot(x=BTS$HaulDur,y=BTS$surface, pch=20, main= "BTS", xlab="Haul duration (minutes)", ylab="Surface (km2)", xlim=c(0,80))
linmod <- lm(surface~ -1 + HaulDur, data=BTS)
summary(linmod)
##
## Call:
## lm(formula = surface ~ -1 + HaulDur, data = BTS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0201602 -0.0008962 0.0005198 0.0018958 0.0247837
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## HaulDur 1.018e-03 9.201e-07 1106 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.003752 on 12741 degrees of freedom
## (569 observations deleted due to missingness)
## Multiple R-squared: 0.9897, Adjusted R-squared: 0.9897
## F-statistic: 1.223e+06 on 1 and 12741 DF, p-value: < 2.2e-16
abline(linmod, lty=2)
BTS[is.na(BTS$surface),]$surface <- as.numeric(predict(linmod,newdata=BTS))[is.na(BTS$surface)]
Indeed, there is a good relationship, where each minute of haul duration adds coef(linmod) km2 haul surface for BTS. This relationship is used to calculate the surface of hauls where only haul duration is known.
Next we combine the two sets into one.
cols <- intersect(colnames(BTS), colnames(IBTS))
cpue <- rbind(BTS[,cols],IBTS[,cols])
Now that the IBTS and BTS data sets are combined we want to make a set where we have counts for all hauls and all lenghts. This means first making an dentifier for each unique haul (based on the information we have for all the hauls). This identifier is used to make a “trawllist” where all the information for the hauls is stored.
Once the trawllist is make we use epand.grid() to make a combination of all hauls and lenght classes. This set is merged with our original data set.
trawllist <- cpue[!duplicated(cpue$haulCode),!names(cpue) %in% c("Species","AphiaID","NoPerHaul","Sex", "LngtClass")]
#expand grid
hxl <- expand.grid(haulCode=unique(cpue$haulCode),LngtClass=unique(cpue$LngtClass), stringsAsFactors = F)
full_cpue <- merge(hxl,cpue[names(cpue) %in% c("haulCode", "LngtClass","NoPerHaul")], all.x=T)
rm(hxl)
After we merged all possible combinations with the data we now have NAs for those lengts and hauls where the catch is zero, and so we set those to zero. This data is subsequently merged to the trawllist so that all information is combined.
full_cpue[is.na(full_cpue$NoPerHaul),]$NoPerHaul <- 0
full_cpue <- merge(full_cpue,trawllist, all.x=T)
The records that have lenghts equal to zero (that indicated zero hauls in our original set ) now need to be removed because we have all the information we need (these hauls now have zero catch for the full length range observed in the survey).
In addition, there are some observations that are highly unlikely: For instance there is a single observation of an individual of 100 cm (in 1977, see one of the sections above). This is suspicious, and we remove it from the data.
#now remove zero lengths
full_cpue <- full_cpue[full_cpue$LngtClass> 0,]
full_cpue <- full_cpue[full_cpue$LngtClass< 990,]
For our spatial correlation we will need an isomorphic coordinate system. Therefore we transform the latitudes and longitudes to UTM coordinates. these UTM coordinates are given in meters, so we divide them by 1000 to get kilometers.
UTM <- project(cbind(full_cpue$ShootLong, full_cpue$ShootLat), "+proj=utm +zone=31U ellps=WGS84")
full_cpue$Xkm <- UTM[,1]/1000
full_cpue$Ykm <- UTM[,2]/1000
The INLA code does not like special characters in some of the variable names, like the hyphen in “NS-IBTS”. Therefore we rename the survey to NSIBTS.
full_cpue <- transform(full_cpue, Survey=gsub("-","",Survey))
We still have a lot of data (2077880), Because these are just here as example, we will again start from startyr 2005.
cpue_subset <- full_cpue[full_cpue$Year >= startyr,]
The earlier we start, the more information we will obtain from the analysis on the annual changes in the count data. However, including more data also means we require more memory to store the model, and wait longer for the model to run. If too much data is used in the inla model, so that more computer memory is required than you have available, the model will crash. Now we have 732000 rows left in cpue_subset. We start with a very simple analysis where we do not take length into account, and only analyse numbers per haul. Hence we aggregate the counts per haul. Because we now want to include the surface rather than the duration we include that in the aggregate.
cpue_subset <- aggregate(NoPerHaul ~ haulCode + Survey + Quarter + Ship + Gear + HaulNo + Year + HaulDur + surface + ShootLong + ShootLat + Xkm + Ykm + Depth, data= cpue_subset,FUN="sum")
We do not plot the exploratory data analayses again. As a reminder of the size of the dataset: the lenght aggregation has reduced it to 9619 rows. This is the number of unique hauls in our data, which now spans years 2005 to 2017.
The UTM coordinates of the data set are combined into a Loc (location) dataset. That data set will be used later to generate spatial meshes for the data.
Loc <- cbind(cpue_subset$Xkm, cpue_subset$Ykm )
Next we need a mesh for the spatial data. Because we do not want our spatial correlations to pass landmasses (e.g. Denmark) we first make a nonconvex hull of the data points using inla.nonconvex.hull(). This convex hull is used as a boundary for making a 2d mesh. The generation of the actual mesh is done using inla.mesh.2d(). That function takes several arguments, incuding “cutoff” and “max.edge”. These arguments specify how fine the final mesh will be. Finer meshes will be able to capture smaller scale spatial correlations, but require more comuting time in the inla model.
par(mfrow=c(1,1))
ConvHull <- inla.nonconvex.hull(Loc, convex=-0.02, resolution=90)
mesh1a <- inla.mesh.2d(boundary = ConvHull, max.edge=c(30))
plot(mesh1a)
points(Loc, col = 1, pch = 16, cex = 0.3)
The alternative from the first examples can of course also be used.
Once the 2d mesh is made we construct a observation/prediction weight matrix for the model. This is also called the “projector matrix”.
# 2. Define the weighting factors a_ik (also called the projector matrix).
A1 <- inla.spde.make.A(mesh = mesh1a, loc = Loc)
dim(A1)
## [1] 9619 1662
The first dimension of the projector matrix has the size of the number of observations (here 9619), and the second dimension of the projector matrix is the number of nodes in the mesh (here 1662).
# 3. Define the spde
spde <- inla.spde2.matern(mesh1a)
w.st <- inla.spde.make.index('w', n.spde = spde$n.spde)
The stack allows INLA to build models with complex linear predictors. Here have a SPDE model combined with covariate fixed effects and an intercept at n hauls.
Before making the stack we need to convert all fixed effects that are factors in the INLA model.
cpue_subset$fYear <- as.factor(cpue_subset$Year)
cpue_subset$fSurvey <- as.factor(cpue_subset$Survey)
Because of the link function of the Poisson and negative binomial that we will be using we need to log-transform the surfaces to get a linear response later
cpue_subset$lsurface <- log(cpue_subset$surface)
Note that in code below the names in the model matrix should not contain any special characters! This is why we renamed the “NS-IBTS” survey to “NSIBTS”
# 5. Make a stack. In this process we tell INLA at which points on the mesh we sampled the response variable and the covariates.
Xmatrix <- model.matrix(~ fYear + fSurvey + lsurface, data = cpue_subset)
This Xmatrix contains the model matrix with the fixed effects, including the intercept (The column for the intercept is named “(Intercept)”, and it is 1 for all observations). However, in the next step the intercept is removed from the model matrix. The intercept is then included when making the stack, and named “Intercept” (without brackets).
X <- as.data.frame(Xmatrix[,-1])
names(X) <- c(gsub("[:]",".",names(X)))
head(X)
## fYear2006 fYear2007 fYear2008 fYear2009 fYear2010 fYear2011 fYear2012 fYear2013 fYear2014 fYear2015 fYear2016
## 1 0 0 0 0 0 0 0 0 0 1 0
## 2 0 0 0 0 0 0 0 0 1 0 0
## 3 0 0 0 0 0 0 1 0 0 0 0
## 4 0 0 0 0 0 0 0 0 1 0 0
## 5 0 0 0 0 0 0 0 0 0 1 0
## 6 0 0 1 0 0 0 0 0 0 0 0
## fYear2017 fSurveyNSIBTS lsurface
## 1 0 0 -3.489106
## 2 0 0 -3.479721
## 3 0 1 -3.013998
## 4 0 0 -3.402799
## 5 0 0 -3.714813
## 6 0 0 -3.518900
N <- nrow(cpue_subset)
Stack1 <- inla.stack(
tag = "Fit",
data = list(y = cpue_subset$NoPerHaul),
A = list(1,1, A1),
effects = list(
Intercept=rep(1,N),
X=X, #Covariates
w=w.st)) #Spatial field
Note that N is the number of rows (9619) in the data set, and thus equal to the first dimension of A1.
The model formula used in the inla model is generated from the names of the model matrix, combined with the intercept term and the spatial correlation model (“f(w, model=spde)”).
Subsequently, two inla models are run, one assuming that the data are Poisson distributed, and another model assuming that the data are negative binomial distributed.
fsp <- parse(text=c("y ~ -1 + Intercept + ",paste(c(names(X)," f(w, model = spde)"),collapse =" + ")))
INLA:::inla.dynload.workaround()
I1p <- inla(eval(fsp), family = "poisson", data=inla.stack.data(Stack1),
control.compute = list(dic = TRUE, waic = TRUE),
control.predictor = list(A = inla.stack.A(Stack1)))
I1nb <- inla(eval(fsp), family = "nbinomial", data=inla.stack.data(Stack1),
control.compute = list(dic = TRUE, waic = TRUE, config=TRUE),
control.predictor = list(A = inla.stack.A(Stack1)))
Once the INLA models are run, a summary can be printed for the two models. This summary contains much of the relevant information for the models.
summary(I1p)
##
## Call:
## c("inla(formula = eval(fsp), family = \"poisson\", data = inla.stack.data(Stack1), ", " control.compute = list(dic = TRUE, waic = TRUE), control.predictor = list(A = inla.stack.A(Stack1)))" )
##
## Time used:
## Pre-processing Running inla Post-processing Total
## 0.6580 251.3026 1.3452 253.3058
##
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## Intercept 1.8990 0.4612 0.9680 1.9070 2.7850 1.9221 0
## fYear2006 -0.1757 0.0413 -0.2570 -0.1757 -0.0947 -0.1756 0
## fYear2007 0.1491 0.0375 0.0755 0.1491 0.2227 0.1491 0
## fYear2008 -0.0842 0.0388 -0.1603 -0.0842 -0.0082 -0.0842 0
## fYear2009 -0.2297 0.0407 -0.3097 -0.2297 -0.1499 -0.2296 0
## fYear2010 -0.3325 0.0413 -0.4138 -0.3325 -0.2515 -0.3324 0
## fYear2011 -0.4074 0.0411 -0.4882 -0.4074 -0.3269 -0.4074 0
## fYear2012 -0.2477 0.0398 -0.3258 -0.2477 -0.1697 -0.2476 0
## fYear2013 -0.6227 0.0437 -0.7086 -0.6227 -0.5372 -0.6225 0
## fYear2014 -0.6225 0.0465 -0.7141 -0.6224 -0.5314 -0.6222 0
## fYear2015 -0.4575 0.0428 -0.5416 -0.4574 -0.3736 -0.4573 0
## fYear2016 -0.6173 0.0434 -0.7026 -0.6172 -0.5324 -0.6171 0
## fYear2017 -0.6636 0.0805 -0.8239 -0.6628 -0.5077 -0.6612 0
## fSurveyNSIBTS -1.9421 0.0572 -2.0549 -1.9419 -1.8302 -1.9416 0
## lsurface 0.8952 0.0650 0.7682 0.8950 1.0233 0.8946 0
##
## Random effects:
## Name Model
## w SPDE2 model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant 0.975quant mode
## Theta1 for w 1.157 0.0601 1.04 1.157 1.276 1.156
## Theta2 for w -3.380 0.0857 -3.55 -3.380 -3.213 -3.378
##
## Expected number of effective parameters(std dev): 610.08(11.50)
## Number of equivalent replicates : 15.77
##
## Deviance Information Criterion (DIC) ...: 24532.38
## Effective number of parameters .........: 593.59
##
## Watanabe-Akaike information criterion (WAIC) ...: 26219.41
## Effective number of parameters .................: 1907.07
##
## Marginal log-Likelihood: -12953.70
## Posterior marginals for linear predictor and fitted values computed
summary(I1nb)
##
## Call:
## c("inla(formula = eval(fsp), family = \"nbinomial\", data = inla.stack.data(Stack1), ", " control.compute = list(dic = TRUE, waic = TRUE, config = TRUE), ", " control.predictor = list(A = inla.stack.A(Stack1)))")
##
## Time used:
## Pre-processing Running inla Post-processing Total
## 0.5855 126.6019 1.5615 128.7489
##
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## Intercept 2.4213 0.8564 0.7078 2.4270 4.1013 2.4382 0
## fYear2006 -0.1696 0.0972 -0.3605 -0.1696 0.0212 -0.1696 0
## fYear2007 0.0462 0.0954 -0.1411 0.0462 0.2335 0.0462 0
## fYear2008 -0.0464 0.0973 -0.2374 -0.0465 0.1446 -0.0465 0
## fYear2009 -0.4770 0.1001 -0.6736 -0.4770 -0.2808 -0.4770 0
## fYear2010 -0.6715 0.1012 -0.8702 -0.6714 -0.4731 -0.6714 0
## fYear2011 -0.6399 0.0999 -0.8360 -0.6398 -0.4441 -0.6398 0
## fYear2012 -0.4117 0.0983 -0.6047 -0.4117 -0.2189 -0.4117 0
## fYear2013 -0.7700 0.1019 -0.9702 -0.7700 -0.5703 -0.7699 0
## fYear2014 -0.5083 0.1046 -0.7136 -0.5084 -0.3031 -0.5084 0
## fYear2015 -0.3547 0.1020 -0.5549 -0.3547 -0.1545 -0.3548 0
## fYear2016 -0.4652 0.1010 -0.6634 -0.4652 -0.2670 -0.4653 0
## fYear2017 -0.3314 0.1429 -0.6117 -0.3314 -0.0508 -0.3316 0
## fSurveyNSIBTS -1.8007 0.1128 -2.0224 -1.8007 -1.5795 -1.8006 0
## lsurface 0.9849 0.1263 0.7372 0.9848 1.2328 0.9846 0
##
## Random effects:
## Name Model
## w SPDE2 model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant 0.975quant mode
## size for the nbinomial observations (1/overdispersion) 0.7399 0.0288 0.6849 0.7393 0.7982 0.7381
## Theta1 for w 2.0268 0.0789 1.8743 2.0256 2.1844 2.0212
## Theta2 for w -4.1347 0.1297 -4.3979 -4.1309 -3.8884 -4.1172
##
## Expected number of effective parameters(std dev): 325.50(15.28)
## Number of equivalent replicates : 29.55
##
## Deviance Information Criterion (DIC) ...: 18850.44
## Effective number of parameters .........: 318.07
##
## Watanabe-Akaike information criterion (WAIC) ...: 18886.57
## Effective number of parameters .................: 315.43
##
## Marginal log-Likelihood: -9719.29
## Posterior marginals for linear predictor and fitted values computed
We’ll plot the spatial mean and sd, but this time with a grid size of 1 by 1 km. Because the sampling locations are on a km grid, we can first find the ranges of the x and y coordinates of the grid, and then have grid cells of size 1.
# we want to make cells of 1 by 1 km. Let's do this first
xl <- range(Loc[,1])
xres <- round(xl[2]-xl[1])
yl <- range(Loc[,2])
yres <- round(yl[2]-yl[1])
wproj <- inla.mesh.projector(mesh1a, dims=c(xres,yres), xlim = xl, ylim = yl)
wm.pm100100 <- inla.mesh.project(wproj, I1nb$summary.random$w$mean)
wsd.pm100100 <- inla.mesh.project(wproj, I1nb$summary.random$w$sd)
grid <- expand.grid(x = wproj$x, y = wproj$y)
grid$zm <- as.vector(wm.pm100100)
grid$zsd <- as.vector(wsd.pm100100)
wld <- map('world', xlim=c(-5,15), ylim=c(47,62),plot=FALSE)
wld <- data.frame(lon=wld$x, lat=wld$y)
UTMmap <- project(cbind(wld$lon, wld$lat), "+proj=utm +zone=31U ellps=WGS84")
UTMmapFinal <- data.frame("xm" =UTMmap[,1]/1e3, "ym" = UTMmap[,2]/1e3)
p1 <- levelplot(zm ~ x * y ,
data = grid,
scales = list(draw = TRUE),
xlab = list("Easting", cex = 1),
ylab = list("Northing", cex = 1),
main = list("Posterior mean spatial random fields", cex = 1),
col.regions=tim.colors(25, alpha = 1),
panel=function(x, y, z, subscripts,...){
panel.levelplot(x, y, z, subscripts,...)
grid.points(x = cpue_subset$Xkm,
y = cpue_subset$Ykm,
pch = 1,
size = unit(cpue_subset$NoPerHaul/15, "char"))
}) + xyplot(ym~ xm, UTMmapFinal, type='l', lty=1, lwd=0.5, col='black')
p2 <- levelplot(zsd ~ x * y,
data = grid,
scales = list(draw = TRUE),
xlab = list("Easting", cex = 1),
ylab = list("Northing", cex = 1),
main = list("Posterior sd spatial random fields", cex = 1),
col.regions=tim.colors(25, alpha = 1),
panel=function(x, y, z, subscripts,...){
panel.levelplot(x, y, z, subscripts,...)
}) + xyplot(ym~ xm, UTMmapFinal, type='l', lty=1, lwd=0.5, col='black')
grid.arrange(p1,p2, ncol=2)
Next we want to simulate a number realizations and integrate over surface (per year) so that we get a population level estimate.
# Simulate regression parameters using inla.posterior.sample
# to simulate from the model. The output is stored in the Sim object.
set.seed(1234)
NSim <- 1000
Sim <- inla.posterior.sample(n = NSim, result = I1nb)
In this example, we thus sample 1000 times. The Sim object is a list of length NSim. Each element of this list contains a single realization. Let’s have a closer look.
names(Sim[[1]])
## [1] "hyperpar" "latent" "logdens"
#get out names of different types of rows in latent
rnames <- rownames(Sim[[1]]$latent)
rtypes <- unique(unlist(lapply(strsplit(rnames,":"),function(x){x[1]})))
rtypes
## [1] "APredictor" "Predictor" "w" "Intercept" "fYear2006" "fYear2007" "fYear2008"
## [8] "fYear2009" "fYear2010" "fYear2011" "fYear2012" "fYear2013" "fYear2014" "fYear2015"
## [15] "fYear2016" "fYear2017" "fSurveyNSIBTS" "lsurface"
There are thus a number of different types of rows in laten:t: APredicto, Predictor, and the fixed effects.
#to get the w par realizations out:
rtypes[3]
## [1] "w"
wrownum <- grep(paste0("^",rtypes[3]),rownames(Sim[[1]]$latent))
wmat <- sapply(Sim, function(x) {x$latent[wrownum]})
dim(wmat)
## [1] 1662 1000
#to get the fixed effect par realizations out
fixed <- rtypes[-(1:3)]
lrownum <- unlist(lapply(fixed, function(x) {grep(x, rownames(Sim[[1]]$latent), fixed = TRUE)} ))
linmat <- sapply(Sim, function(x) {x$latent[lrownum]})
dim(linmat)
## [1] 15 1000
dimnames(linmat)[[1]] <- fixed
Sim is thus an object that has …
#NOTE THAT WE SHOULD THINK ABOUT how distance works in model because of log link
#NOTE THAT YEARS DO NOT WORK YET HERE (NEED MORE WORK ON fixed effects)
numbers <- matrix(NA, ncol=length( startyr:2017), nrow=NSim, dimnames=list("sim"=1:NSim,"year"=startyr:2017))
for (ii in startyr:2017){
for (jj in 1:NSim){
#intercept + year + (1 times beta for surf, but here 0 times beta for lsurface (so we get 1 km))
if (ii==startyr){
lin <- linmat["Intercept",jj] + 0 * linmat["lsurface",jj]
}else{
lin <- linmat["Intercept",jj] + linmat[paste0("fYear",ii),jj] + 0 * linmat["lsurface",jj]
}
#commented out section is for including the spatial corr, but given that the sum of this field ==0, adding it would not make a difference
# wm.pm100100 <- inla.mesh.project(wproj, wmat[,jj])
#res <- exp(lin + wm.pm100100 )
#numbers[jj,as.character(ii)] <- sum(res,na.rm=T)
res <- exp(lin)
numbers[jj,as.character(ii)] <- res
}
}
Number is the estimated number per km2 (minus the spatial effect) what are the conf bounds for numbers per year?
qnumbers <- apply(numbers,2, quantile,c(0.025,0.5,0.975))
matplot(t(qnumbers), xlab="Years", ylab="Numbers (km-2)")
what are the average weights per year? To go from number to weight we need alpha and beta from growth curve (for length in cm and resulting W is in g ) alpha=0.156650 beta=2.190 ref is bedford et al. 1986
alpha <- 0.156650
beta <- 2.190
full_cpue$wt <- (alpha * (full_cpue$LngtClass/10)^beta * full_cpue$NoPerHaul) /1000
annualmnwt <- aggregate(cbind(wt,NoPerHaul) ~ Year, data= full_cpue, FUN= "sum")
annualmnwt <- within(annualmnwt, mnwt <- wt/NoPerHaul)
plot(mnwt~Year, data=annualmnwt, type="b")
Multiply weights by numbers per km2 to get kg per km2
qwts <- annualmnwt[annualmnwt$Year >= startyr,]$mnwt * qnumbers
Now that we have kg per km2 we need to multiply by surface of North Sea, and go from numbers to weights. For now we assume that total surface ==750000 km,but this is probably overestmitae for us because not everything is sampled. We shoudl use the mesh, which has sum of 0, but at least gives us proper surface. Divde by 1000 to go from kg to tonnes
qwtsNS <- (qwts * 750000)/1000
We only need stack without w
N <- nrow(cpue_subset)
Stack1 <- inla.stack(
tag = "Fit",
data = list(y = cpue_subset$NoPerHaul),
A = list(1,1),
effects = list(
Intercept=rep(1,N),
X=X )) #Covariates
## Note: method with signature 'diagonalMatrix#sparseMatrix' chosen for function 'cbind2',
## target signature 'ddiMatrix#ddiMatrix'.
## "sparseMatrix#diagonalMatrix" would also be valid
fsp <- parse(text=c("y ~ -1 + Intercept + ",paste(c(names(X)),collapse =" + ")))
INLA:::inla.dynload.workaround()
I1p <- inla(eval(fsp), family = "poisson", data=inla.stack.data(Stack1),
control.compute = list(dic = TRUE, waic = TRUE),
control.predictor = list(A = inla.stack.A(Stack1)))
I1nb <- inla(eval(fsp), family = "nbinomial", data=inla.stack.data(Stack1),
control.compute = list(dic = TRUE, waic = TRUE, config=TRUE),
control.predictor = list(A = inla.stack.A(Stack1)))
check results
summary(I1p)
##
## Call:
## c("inla(formula = eval(fsp), family = \"poisson\", data = inla.stack.data(Stack1), ", " control.compute = list(dic = TRUE, waic = TRUE), control.predictor = list(A = inla.stack.A(Stack1)))" )
##
## Time used:
## Pre-processing Running inla Post-processing Total
## 0.3124 5.8973 0.9951 7.2048
##
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## Intercept 4.4888 0.1805 4.1355 4.4884 4.8439 4.4876 0
## fYear2006 -0.0856 0.0383 -0.1609 -0.0856 -0.0106 -0.0855 0
## fYear2007 0.4841 0.0349 0.4156 0.4841 0.5525 0.4840 0
## fYear2008 0.3015 0.0362 0.2305 0.3015 0.3725 0.3015 0
## fYear2009 -0.1146 0.0376 -0.1885 -0.1146 -0.0409 -0.1145 0
## fYear2010 -0.0092 0.0387 -0.0852 -0.0091 0.0667 -0.0091 0
## fYear2011 0.0105 0.0386 -0.0654 0.0105 0.0861 0.0106 0
## fYear2012 -0.1192 0.0371 -0.1920 -0.1192 -0.0465 -0.1191 0
## fYear2013 -0.3349 0.0396 -0.4128 -0.3348 -0.2572 -0.3347 0
## fYear2014 -0.4276 0.0431 -0.5125 -0.4275 -0.3434 -0.4273 0
## fYear2015 -0.3723 0.0401 -0.4512 -0.3723 -0.2938 -0.3722 0
## fYear2016 -0.4715 0.0407 -0.5517 -0.4715 -0.3918 -0.4713 0
## fYear2017 -0.5392 0.0774 -0.6938 -0.5384 -0.3895 -0.5367 0
## fSurveyNSIBTS -2.1474 0.0452 -2.2364 -2.1473 -2.0590 -2.1472 0
## lsurface 0.8524 0.0505 0.7536 0.8523 0.9518 0.8520 0
##
## The model has no random effects
##
## The model has no hyperparameters
##
## Expected number of effective parameters(std dev): 15.09(0.00)
## Number of equivalent replicates : 637.25
##
## Deviance Information Criterion (DIC) ...: 61281.62
## Effective number of parameters .........: 15.09
##
## Watanabe-Akaike information criterion (WAIC) ...: 61562.36
## Effective number of parameters .................: 291.12
##
## Marginal log-Likelihood: -30731.07
## Posterior marginals for linear predictor and fitted values computed
summary(I1nb)
##
## Call:
## c("inla(formula = eval(fsp), family = \"nbinomial\", data = inla.stack.data(Stack1), ", " control.compute = list(dic = TRUE, waic = TRUE, config = TRUE), ", " control.predictor = list(A = inla.stack.A(Stack1)))")
##
## Time used:
## Pre-processing Running inla Post-processing Total
## 0.4282 12.0632 1.1757 13.6671
##
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## Intercept 3.6037 0.5404 2.5365 3.6057 4.6588 3.6098 0
## fYear2006 -0.1369 0.1376 -0.4067 -0.1370 0.1333 -0.1372 0
## fYear2007 0.3080 0.1396 0.0346 0.3077 0.5825 0.3073 0
## fYear2008 0.0637 0.1391 -0.2088 0.0635 0.3371 0.0632 0
## fYear2009 -0.2596 0.1378 -0.5298 -0.2597 0.0109 -0.2599 0
## fYear2010 -0.3640 0.1394 -0.6373 -0.3642 -0.0902 -0.3644 0
## fYear2011 -0.2922 0.1393 -0.5653 -0.2924 -0.0187 -0.2927 0
## fYear2012 -0.3276 0.1364 -0.5953 -0.3276 -0.0600 -0.3277 0
## fYear2013 -0.6097 0.1383 -0.8812 -0.6098 -0.3384 -0.6098 0
## fYear2014 -0.6377 0.1427 -0.9174 -0.6379 -0.3572 -0.6383 0
## fYear2015 -0.4450 0.1396 -0.7189 -0.4450 -0.1710 -0.4451 0
## fYear2016 -0.5671 0.1371 -0.8364 -0.5671 -0.2982 -0.5671 0
## fYear2017 -0.7169 0.1895 -1.0834 -0.7188 -0.3396 -0.7226 0
## fSurveyNSIBTS -1.8901 0.1381 -2.1605 -1.8904 -1.6185 -1.8909 0
## lsurface 0.5541 0.1520 0.2536 0.5547 0.8506 0.5560 0
##
## The model has no random effects
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant 0.975quant mode
## size for the nbinomial observations (1/overdispersion) 0.147 0.0037 0.1398 0.1469 0.1544 0.1468
##
## Expected number of effective parameters(std dev): 15.01(3e-04)
## Number of equivalent replicates : 640.97
##
## Deviance Information Criterion (DIC) ...: 24037.20
## Effective number of parameters .........: 15.97
##
## Watanabe-Akaike information criterion (WAIC) ...: 24042.22
## Effective number of parameters .................: 20.32
##
## Marginal log-Likelihood: -12111.31
## Posterior marginals for linear predictor and fitted values computed
We have depth for many but not all hauls in the full cpue set. The missing values are indicted by a value of -9, and these have to be transformed into NAs
nrow(cpue_subset)
## [1] 9619
summary(cpue_subset$Depth)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9.00 38.00 60.00 69.87 93.00 415.00
cpue_subset[cpue_subset$Depth ==-9,]$Depth <- NA
summary(cpue_subset$Depth)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 10.00 38.00 60.00 69.98 93.00 415.00 13
Given that the depth for a given location does not change much over the years, we use a generalized additive model to model a depth map, and to predict depths for those hauls where depth is missing. Using the depth in INLA requires that the depth is binned. We use 2 m depth bins.
depthmodel <- gam(Depth~te( ShootLong,ShootLat,k=17), data=cpue_subset)
summary(depthmodel)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Depth ~ te(ShootLong, ShootLat, k = 17)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.97814 0.07877 888.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## te(ShootLong,ShootLat) 264.1 264.5 955.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Rank: 285/289
## R-sq.(adj) = 0.963 Deviance explained = 96.4%
## GCV = 61.295 Scale est. = 59.603 n = 9606
gam.check(depthmodel)
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 21 iterations.
## The RMS GCV score gradient at convergence was 0.0006764948 .
## The Hessian was positive definite.
## Model rank = 285 / 289
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## te(ShootLong,ShootLat) 288.000 264.130 0.422 0
plot(depthmodel)
cpue_subset[is.na(cpue_subset$Depth),]$Depth <-predict(depthmodel,newdata= cpue_subset[is.na(cpue_subset$Depth),])
cpue_subset$Depth <- round(cpue_subset$Depth/2,0)*2
N <- nrow(cpue_subset)
Stack1 <- inla.stack(
tag = "Fit",
data = list(y = cpue_subset$NoPerHaul),
A = list(1,1,1, A1),
effects = list(
Intercept=rep(1,N),
X=X, #Covariates
Depth = cpue_subset$Depth,
w = w.st)) #Spatial field
#6. Specify the model formula
fsp <- parse(text=c("y ~ -1 + Intercept + ",paste(c(names(X)," f(w, model = spde)", ' f(Depth,model="rw2")'),collapse =" + ")))
INLA:::inla.dynload.workaround()
I1p <- inla(eval(fsp), family = "poisson", data=inla.stack.data(Stack1),
control.compute = list(dic = TRUE, waic = TRUE, openmp.strategy="large"),
control.predictor = list(A = inla.stack.A(Stack1)))
I1nb <- inla(eval(fsp), family = "nbinomial", data=inla.stack.data(Stack1),
control.compute = list(dic = TRUE, waic = TRUE, config=TRUE, openmp.strategy="large"),
control.predictor = list(A = inla.stack.A(Stack1)))
summary(I1p)
##
## Call:
## c("inla(formula = eval(fsp), family = \"poisson\", data = inla.stack.data(Stack1), ", " control.compute = list(dic = TRUE, waic = TRUE, openmp.strategy = \"large\"), ", " control.predictor = list(A = inla.stack.A(Stack1)))")
##
## Time used:
## Pre-processing Running inla Post-processing Total
## 0.6445 203.1507 1.3772 205.1723
##
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## Intercept -0.4331 1.1288 -2.7534 -0.3332 1.4500 -0.0627 0
## fYear2006 -0.1897 0.0419 -0.2722 -0.1897 -0.1075 -0.1896 0
## fYear2007 0.1510 0.0379 0.0767 0.1510 0.2253 0.1510 0
## fYear2008 -0.0821 0.0393 -0.1592 -0.0821 -0.0051 -0.0821 0
## fYear2009 -0.2211 0.0414 -0.3024 -0.2211 -0.1400 -0.2210 0
## fYear2010 -0.3520 0.0424 -0.4355 -0.3520 -0.2689 -0.3519 0
## fYear2011 -0.3862 0.0416 -0.4680 -0.3862 -0.3046 -0.3861 0
## fYear2012 -0.2605 0.0402 -0.3394 -0.2605 -0.1817 -0.2605 0
## fYear2013 -0.6260 0.0440 -0.7125 -0.6260 -0.5399 -0.6259 0
## fYear2014 -0.6277 0.0471 -0.7206 -0.6276 -0.5355 -0.6275 0
## fYear2015 -0.4650 0.0433 -0.5501 -0.4649 -0.3802 -0.4648 0
## fYear2016 -0.6207 0.0441 -0.7074 -0.6206 -0.5344 -0.6205 0
## fYear2017 -0.6342 0.0809 -0.7953 -0.6334 -0.4775 -0.6318 0
## fSurveyNSIBTS -1.9174 0.0584 -2.0325 -1.9172 -1.8032 -1.9169 0
## lsurface 0.8700 0.0661 0.7408 0.8698 1.0003 0.8694 0
##
## Random effects:
## Name Model
## w SPDE2 model
## Depth RW2 model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant 0.975quant mode
## Theta1 for w 1.129 0.0651 1.004 1.128 1.26 1.123
## Theta2 for w -3.285 0.0919 -3.471 -3.283 -3.11 -3.275
## Precision for Depth 762.988 651.9957 179.018 571.578 2487.53 363.685
##
## Expected number of effective parameters(std dev): 644.06(12.99)
## Number of equivalent replicates : 14.93
##
## Deviance Information Criterion (DIC) ...: 1.254e+34
## Effective number of parameters .........: 6.269e+33
##
## Watanabe-Akaike information criterion (WAIC) ...: 26120.79
## Effective number of parameters .................: 1977.49
##
## Marginal log-Likelihood: -12813.98
## Posterior marginals for linear predictor and fitted values computed
summary(I1nb)
##
## Call:
## c("inla(formula = eval(fsp), family = \"nbinomial\", data = inla.stack.data(Stack1), ", " control.compute = list(dic = TRUE, waic = TRUE, config = TRUE, ", " openmp.strategy = \"large\"), control.predictor = list(A = inla.stack.A(Stack1)))" )
##
## Time used:
## Pre-processing Running inla Post-processing Total
## 0.7022 254.4239 1.9896 257.1157
##
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## Intercept 2.2184 0.7994 0.6183 2.2249 3.7809 2.2383 0
## fYear2006 -0.1694 0.0976 -0.3609 -0.1694 0.0220 -0.1694 0
## fYear2007 0.0368 0.0959 -0.1514 0.0368 0.2249 0.0368 0
## fYear2008 -0.0530 0.0976 -0.2447 -0.0531 0.1386 -0.0531 0
## fYear2009 -0.4771 0.1004 -0.6743 -0.4770 -0.2801 -0.4770 0
## fYear2010 -0.6733 0.1015 -0.8726 -0.6733 -0.4744 -0.6732 0
## fYear2011 -0.6453 0.1002 -0.8421 -0.6453 -0.4489 -0.6452 0
## fYear2012 -0.4224 0.0985 -0.6159 -0.4224 -0.2292 -0.4224 0
## fYear2013 -0.7712 0.1020 -0.9716 -0.7712 -0.5713 -0.7711 0
## fYear2014 -0.5216 0.1049 -0.7275 -0.5217 -0.3158 -0.5217 0
## fYear2015 -0.3551 0.1023 -0.5559 -0.3551 -0.1543 -0.3552 0
## fYear2016 -0.4764 0.1012 -0.6750 -0.4765 -0.2779 -0.4765 0
## fYear2017 -0.3379 0.1435 -0.6193 -0.3379 -0.0562 -0.3381 0
## fSurveyNSIBTS -1.8284 0.1126 -2.0495 -1.8283 -1.6077 -1.8282 0
## lsurface 0.9922 0.1263 0.7445 0.9921 1.2401 0.9920 0
##
## Random effects:
## Name Model
## w SPDE2 model
## Depth RW2 model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant 0.975quant mode
## size for the nbinomial observations (1/overdispersion) 0.7327 2.860e-02 0.6785 0.7319 0.7909 0.7303
## Theta1 for w 2.1312 8.960e-02 1.9574 2.1300 2.3095 2.1261
## Theta2 for w -4.1179 1.367e-01 -4.3933 -4.1148 -3.8566 -4.1038
## Precision for Depth 31706.0003 1.916e+04 8551.4532 27295.6559 81087.7025 19782.0348
##
## Expected number of effective parameters(std dev): 309.54(16.49)
## Number of equivalent replicates : 31.08
##
## Deviance Information Criterion (DIC) ...: 18865.27
## Effective number of parameters .........: 304.27
##
## Watanabe-Akaike information criterion (WAIC) ...: 18904.41
## Effective number of parameters .................: 306.04
##
## Marginal log-Likelihood: -9588.22
## Posterior marginals for linear predictor and fitted values computed
#plot depth smoother
Depthm <- I1nb$summary.random$Depth
par(mfrow = c(1,1), mar = c(5,5,2,2), cex.lab = 1.5)
plot(Depthm[,1:2], type='l',
xlab = 'Depth',
ylab = 'Smoother',
ylim = c(-2, 2) )
abline(h=0, lty=3)
lines(Depthm[, c(1, 4)], lty=2)
lines(Depthm[, c(1, 6)], lty=2)
rug(cpue_subset$Depth)
# we want to make cells of 1 by 1 km. Let's do this first
xl <- range(Loc[,1])
xres <- round(xl[2]-xl[1])
yl <- range(Loc[,2])
yres <- round(yl[2]-yl[1])
wproj <- inla.mesh.projector(mesh1a, dims=c(xres,yres), xlim = xl, ylim = yl)
wm.pm100100 <- inla.mesh.project(wproj, I1nb$summary.random$w$mean)
wsd.pm100100 <- inla.mesh.project(wproj, I1nb$summary.random$w$sd)
grid <- expand.grid(x = wproj$x, y = wproj$y)
grid$zm <- as.vector(wm.pm100100)
grid$zsd <- as.vector(wsd.pm100100)
wld <- map('world', xlim=c(-5,15), ylim=c(47,62),plot=FALSE)
wld <- data.frame(lon=wld$x, lat=wld$y)
UTMmap <- project(cbind(wld$lon, wld$lat), "+proj=utm +zone=31U ellps=WGS84")
UTMmapFinal <- data.frame("xm" =UTMmap[,1]/1e3, "ym" = UTMmap[,2]/1e3)
p1 <- levelplot(zm ~ x * y ,
data = grid,
scales = list(draw = TRUE),
xlab = list("Easting", cex = 1),
ylab = list("Northing", cex = 1),
main = list("Posterior mean spatial random fields", cex = 1),
col.regions=tim.colors(25, alpha = 1),
panel=function(x, y, z, subscripts,...){
panel.levelplot(x, y, z, subscripts,...)
grid.points(x = cpue_subset$Xkm,
y = cpue_subset$Ykm,
pch = 1,
size = unit(cpue_subset$NoPerHaul/15, "char"))
}) + xyplot(ym~ xm, UTMmapFinal, type='l', lty=1, lwd=0.5, col='black')
p2 <- levelplot(zsd ~ x * y,
data = grid,
scales = list(draw = TRUE),
xlab = list("Easting", cex = 1),
ylab = list("Northing", cex = 1),
main = list("Posterior sd spatial random fields", cex = 1),
col.regions=tim.colors(25, alpha = 1),
panel=function(x, y, z, subscripts,...){
panel.levelplot(x, y, z, subscripts,...)
}) + xyplot(ym~ xm, UTMmapFinal, type='l', lty=1, lwd=0.5, col='black')
grid.arrange(p1,p2, ncol=2)
for (i in 1: NSim){
Betas <- SimData[[i]]$latent[as.numeric(rownum)]
wk <- SimData[[i]]$latent[2442:3633]
FixedPart <- Xm %*% Betas
SpatialPart <- Am %*% wk
mu.i[,i] <- exp(FixedPart + SpatialPart)
Ysim[,i] <- rpois(n = nrow(LP), lambda = mu.i[,i])
}
table(Ysim[,1])
table(Ysim[,2])
table(Ysim[,3])
# Don't try to understand all the fancy code below Just the principle is enough.
# RUN FROM HERE.....
Z <- matrix(nrow = max(Ysim)+1, ncol = NSim)
for (i in 1: NSim){
zi <- table(Ysim[,i])
I <- as.numeric(names(zi)) + 1
Z[I,i] <- zi
}
par(mfrow = c(1,1), mar = c(5,5,2,2), cex.lab = 1.5)
Z[is.na(Z)] <- 0
Xi <- 0: max(Ysim)
AverageTable <- rowSums(Z) / NSim
AverageTable <- rowSums(Z) / NSim
apply(Z, 1, mean)
SD <- apply(Z, 1, sd)
Zs <- table(LP$nSIE)
nx <- length(Zs)
NamesZ <- as.numeric(names(Zs))
plot(x = Xi, y = AverageTable, type = "h", lwd = 5, xlab = "Simulated endemic richness values", ylab = "Frequencies", ylim = c(0, 250))
# Add SDs
for (i in 1:nx){
segments(x0 = NamesZ[i], x1 = NamesZ[i], y0 = AverageTable[i], y1 = AverageTable[i] + SD[i],lwd = 15,col = 1)
}
#And add the table for the observed data
for (i in 1:nx){
segments(x0 = NamesZ[i] + 0.2, x1 = NamesZ[i] + 0.2, y0 = 0, y1 = Zs[i], lwd = 2, col = 2)
}
# The red bars represent the frequency table for the observed data. The black lines is
# the average frequency table for the 1000 simulated data sets. They should hopefully
# match. If they do then the model simulates data that is comparable to the observed data.
# It is not perfect. We are predicting too many zeros, but not enough ones and twos.
#############################################
# Now that we have 1000 simulated data sets from the model, what else shall we do with these simulated
# data sets?
# We could calculate the number of zeros in each of the 1,000 data sets.
zeros <- vector(length = NSim)
for(i in 1:NSim){
zeros[i] <- sum(Ysim[,i] == 0)
}
table(zeros)
# From the 1,000 simulated data sets, in 2 simulated data sets we had 1141 zeros. In 2 simulated data sets
# we had 1148 zeros,etc...... Your results will be different as mine.
# Just type
Ysim[,1]
Ysim[,2]
Ysim[,3]
#etc
#Let's plot this as a table
plot(table(zeros),
#axes = FALSE,
xlab = "How often do we have 0, 1, 2, 3, etc. number of zeros",
ylab = "Number of zeros in 1000 simulated data sets",
xlim = c(50, 150),
main = "Simulation results")
points(x = sum(LP$nSIE == 0), y = 0, pch = 16, cex = 5, col = 2)
# The red dot is the number of zeros in the original data set.
# The data simulated from the Poisson model contains to many zeros.
#########################################
#########################################
# What else can we do with the 1,000 simulated data sets? We can also calculate the dispersion statistic for each
# simulated data set
p <- length(Betas) + 2 #Number of regression parameters
Dispersion <- vector(length = NSim) #Create space
# Calculate the dispersion for each simulated data set
for(i in 1:NSim){
e2 <- (Ysim[,i] - mu2) /sqrt(mu2)
Dispersion[i] <- sum(e2^2) / (N - p)
}
# Plot this as a table
hist(Dispersion,
xlab = "Dispersion",
ylab = "Frequency",
xlim = c(0.6, 1.5),
breaks = 25)
# And visualize the dispersion for the original Poisson GLMM
points(x = Dispersion2,
y = 0,
pch = 16,
cex = 3,
col = 2)
# The red dot is the underdispersion in the original data set.
# We have serious underdispersion!
# Bugger.
# A Generalized Poisson model may work.
#
# Simulate from model I1nb
# Simulation study spatial Negbin model.
# First set random seed and the number of simulations
set.seed(12345)
NSim <- 1000
# Carry out the simulation
Sim <- inla.posterior.sample(n = NSim, result = I1nb)
# Now we have 1000 simulated betas and also 1000 spatial fields w
# Use these to calculate 1000 times the expected values mu,
# and simulate count data from these using rpois
# X matrix
#Xm <- model.matrix(~BreedingD.std + Rain.std, data = cpue_subset)
Xm <- model.matrix(~ fYear + fSurvey + surface , data = cpue_subset)
# Create space to store the information
N <- nrow(cpue_subset)
Ysim <- matrix(nrow = N, ncol = NSim)
mu.i <- matrix(nrow = N, ncol = NSim)
Xm <- as.matrix(Xm)
Am <- as.matrix(A1)
# We are calculating a Fixed part (X * beta) and also the random part (A * w)
for (i in 1: NSim){
Betas <- Sim[[i]]$latent[c(987, 988, 989)]
wk <- Sim[[i]]$latent[416:986]
eta <- Xm %*% Betas + Am %*% wk
mu.i[,i] <- exp(eta)
Ysim[,i] <- rpois(n = nrow(OCW), lambda = mu.i[,i])
}
# Now we have 1,000 simulated count data sets from the model
# And plot the results of the simulation study
Z <- matrix(nrow = max(Ysim)+1, ncol = NSim)
for (i in 1: NSim){
zi <- table(Ysim[,i])
I <- as.numeric(names(zi)) + 1
Z[I,i] <- zi
}